Distributed Cybersecurity Vulnerability Search Engine

A Hadoop-based vulnerability search engine that builds a distributed TF-IDF index over NVD CVE records and provides an interactive Java Swing interface for ranked CVE search.

The system uses Hadoop HDFS for distributed storage, MapReduce for offline indexing, and an in-memory Java search layer for interactive query serving. It was designed around NVD CVE data normalized into JSON Lines, where each line is one independent CVE document.

Highlights

Distributed storage of CVE JSONL datasets in HDFS
MapReduce pipeline for tokenization, inverted index construction, and TF-IDF scoring
Java Swing GUI with HDFS file management, MapReduce job monitoring, and CVE search
AND/OR query mode, Top-N search, Show All Results, sortable result columns, and detailed raw JSON view
Hadoop configuration and preprocessing scripts included for reproducibility

User Interface

Search Interface

CVE Detail View

Repository Structure

src/                 Java source code for MapReduce jobs, CLI search, and Swing GUI
preprocessing/       Python scripts for converting NVD CVE JSON files to JSONL datasets
hadoop-config/       Hadoop XML configuration files used in the 2-node VM cluster
dist/                Built project JAR
screenshots/         GUI, architecture, HDFS UI, and YARN UI screenshots
datasets/            Dataset documentation only; large JSONL files are gitignored
raw-nvd-json/        Raw NVD JSON documentation only; raw JSON files are gitignored

The technical report is included as BLM4821_CVE_Engine_Report.pdf.

Data Source and Dataset Variants

The project uses NVD CVE data. Raw CVE JSON files were obtained from the community-maintained fkie-cad/nvd-json-data-feeds repository, which mirrors/reconstructs NVD JSON data feed packages from NVD data:

https://github.com/fkie-cad/nvd-json-data-feeds

The raw JSON files were normalized into JSON Lines format. One JSONL line corresponds to one CVE record and one searchable document. The implementation was tested on CVE data from 2018-2025, but the preprocessing and indexing pipeline can be applied to other NVD year ranges as well.

Dataset	Size	Document count	HDFS raw path
Compact	107.5 MB	222,083	`/raw/cve_2018_2025_compact.jsonl`
Enriched	283.1 MB	222,083	`/raw/cve_2018_2025_enriched.jsonl`
Large	523.6 MB	222,083	`/raw/cve_2018_2025_500mb.jsonl`

Large raw files are intentionally excluded from GitHub. The local datasets/ and raw-nvd-json/ directories are ignored because they are hundreds of megabytes to more than one gigabyte. Use the scripts under preprocessing/ to recreate the JSONL datasets from raw NVD JSON files.

Hadoop Cluster Environment

Host: Windows laptop running VirtualBox
Guest OS: Ubuntu Server 22.04
Hadoop: 3.4.1
Java: 11
Cluster layout:
- hadoopmaster / 192.168.56.101: NameNode, ResourceManager, DataNode, NodeManager
- hadoop-worker1 / 192.168.56.102: DataNode, NodeManager

Monitoring interfaces used during development:

HDFS NameNode UI: http://192.168.56.101:9870
YARN ResourceManager UI: http://192.168.56.101:8088/cluster

Hadoop Configuration

The hadoop-config/ directory contains the cluster configuration used for this project:

core-site.xml: default filesystem, e.g. hdfs://hadoopmaster:9000
hdfs-site.xml: HDFS directories, replication, NameNode/DataNode settings
mapred-site.xml: MapReduce configured to run on YARN
yarn-site.xml: ResourceManager and NodeManager addresses
hadoop-env.sh: Hadoop environment variables such as JAVA_HOME
workers: worker node list

Configuration files were prepared on the master node and copied to the worker node using scp.

MapReduce Pipeline

Job 1: CVE Tokenizer

Reads raw JSONL CVE records, extracts relevant fields, normalizes text, and emits one tokenized document per CVE.

hadoop jar dist/cve-search.jar com.cvesearch.CveTokenizerJob \
  /raw/cve_2018_2025_compact.jsonl /tokens/compact

Job 2: Inverted Index

Builds posting lists from tokenized documents.

hadoop jar dist/cve-search.jar com.cvesearch.InvertedIndexJob \
  /tokens/compact /index/compact

Output format:

term -> CVE:tf,CVE:tf,...

Job 3: TF-IDF Index

Computes TF-IDF scores for each term-document pair.

hadoop jar dist/cve-search.jar com.cvesearch.TfIdfJob \
  /index/compact /tfidf/compact 222083

Formula:

TF-IDF = TF * log(N / DF)

where N is the document count and DF is the number of documents containing the term.

Running the GUI

The GUI loads the TF-IDF index and raw CVE JSONL records from HDFS into memory. Use a larger heap for enriched or large datasets.

java -Xmx2g -cp "dist/cve-search.jar:$(hadoop classpath)" com.cvesearch.CveSearchGUI

For the large dataset:

java -Xmx2500m -cp "dist/cve-search.jar:$(hadoop classpath)" com.cvesearch.CveSearchGUI

GUI modules:

HDFS File Manager: browse HDFS, upload VM file to HDFS, download HDFS file to VM, delete, refresh
MapReduce Job Monitor: run Tokenizer, Inverted Index, TF-IDF, or Full Pipeline
Search Interface: load index, search CVEs, sort results, inspect detailed CVE records

Search Model

Search does not run MapReduce. MapReduce is used only for offline index construction. At query time, the GUI reads the precomputed TF-IDF index from HDFS into memory, combines posting lists using AND/OR logic, ranks CVEs by accumulated TF-IDF score, and displays Top-N or all matches.

Example search results:

Dataset	Query	Mode	Display	Matches	Shown	Query time
Compact	`apache`	AND	Show All	1910	1910	0.591
Large	`apache`	AND	Top-N 50	1934	50	0.571 s

Measured MapReduce Execution Times

Dataset	Tokenizer	Inverted Index	TF-IDF	Total
Compact 107.5 MB	5m58.959s	11m29.182s	3m29.980s	20m58.121s
Enriched 283.1 MB	6m07.060s	9m18.300s	4m26.863s	19m52.223s
Large 523.6 MB	5m57.599s	9m08.788s	2m45.042s	17m51.429s

The runtime does not increase monotonically with raw dataset size because this small virtualized Hadoop cluster is affected by HDFS block placement, input splits, disk cache, JVM warm-up, current VM load, YARN scheduling overhead, and output characteristics.

Notes on Large Files

GitHub has practical file and repository size limits. Raw NVD JSON files and generated JSONL datasets are excluded from version control. The repository keeps the source code, preprocessing scripts, Hadoop configuration, screenshots, built JAR, and report while documenting how to recreate the data locally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Cybersecurity Vulnerability Search Engine

Highlights

User Interface

Search Interface

CVE Detail View

Repository Structure

Data Source and Dataset Variants

Hadoop Cluster Environment

Hadoop Configuration

MapReduce Pipeline

Job 1: CVE Tokenizer

Job 2: Inverted Index

Job 3: TF-IDF Index

Running the GUI

Search Model

Measured MapReduce Execution Times

Notes on Large Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
datasets		datasets
dist		dist
hadoop-config		hadoop-config
preprocessing		preprocessing
raw-nvd-json		raw-nvd-json
screenshots		screenshots
src/com/cvesearch		src/com/cvesearch
.gitignore		.gitignore
BLM4821_CVE_Engine_Report.pdf		BLM4821_CVE_Engine_Report.pdf
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Distributed Cybersecurity Vulnerability Search Engine

Highlights

User Interface

Search Interface

CVE Detail View

Repository Structure

Data Source and Dataset Variants

Hadoop Cluster Environment

Hadoop Configuration

MapReduce Pipeline

Job 1: CVE Tokenizer

Job 2: Inverted Index

Job 3: TF-IDF Index

Running the GUI

Search Model

Measured MapReduce Execution Times

Notes on Large Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages